Debugging Data in R

Hannah Flaherty
2018-06-14

Debugging is hard

title

There are some great tools in R

  • print()
  • browser()
  • debug()

And RStudio has helpful options

  • breakpoints - acts like browser

break

  • “break in code” - goes into debug mode upon encountering an error

debug

But debugging in R means debugging data too

head(perfect_dataset)
# A tibble: 6 x 7
     id species    name   age  legs weight        color
  <int>   <chr>   <chr> <dbl> <int>  <int>        <chr>
1     1     cat     Ada     9     4     12  white, gray
2     2     dog  Mobius    11     4     55        black
3     3     dog    Mazy   119     4     40 black, white
4     4     cat     Bea     7     4      8 black, white
5     5     dog   Judge     8     4     75 brown, black
6     6     cat Visitor    15     4     10  gray, white

summary() is a good start

great for quick look at class of each column and numbers, getting a sense of missing values

summary(perfect_dataset)
       id          species              name                age        
 Min.   : 1.00   Length:19          Length:19          Min.   :  1.00  
 1st Qu.: 5.50   Class :character   Class :character   1st Qu.:  2.75  
 Median :10.00   Mode  :character   Mode  :character   Median :  6.00  
 Mean   :10.37                                         Mean   : 11.97  
 3rd Qu.:15.50                                         3rd Qu.:  8.50  
 Max.   :20.00                                         Max.   :119.00  

      legs           weight         color          
 Min.   :0.000   Min.   : 2.00   Length:19         
 1st Qu.:4.000   1st Qu.:10.00   Class :character  
 Median :4.000   Median :15.00   Mode  :character  
 Mean   :3.684   Mean   :25.06                     
 3rd Qu.:4.000   3rd Qu.:40.00                     
 Max.   :4.000   Max.   :75.00                     
                 NA's   :2                         

not so helpful for character fields

View()

Sometimes you just want to look at your data!

View(perfect_dataset)

view_it

View()

But what if it's too much data?

loading

View() lets you look at a sliced up dataframe

View(world_cities_pop %>% group_by(Country) %>% summarise(num_cities = length(City)))

view_slice

And you don't have to save it in your environment

Rules for data, and life

1. Don't trust anyone

Verify all assertions about your dataset.

“All animals in this dataset are mammals.”

perfect_dataset %>% 
  group_by(species) %>%
  summarise(length(species))
# A tibble: 3 x 2
  species `length(species)`
    <chr>             <int>
1     cat                10
2     dog                 8
3   snake                 1

Almost…

2. Use common sense

After your manipulation, does it look right? Does it make sense with what you were expecting?

perfect_dataset %>%
  group_by(species) %>%
  summarise(mean(age))
# A tibble: 3 x 2
  species `mean(age)`
    <chr>       <dbl>
1     cat      6.9000
2     dog     19.6875
3   snake      1.0000

Wait, what?

# A tibble: 6 x 7
     id species    name   age  legs weight        color
  <int>   <chr>   <chr> <dbl> <int>  <int>        <chr>
1     3     dog    Mazy   119     4     40 black, white
2     6     cat Visitor    15     4     10  gray, white
3     2     dog  Mobius    11     4     55        black
4     1     cat     Ada     9     4     12  white, gray
5    19     cat   Henry     9     4     15 black, white
6     5     dog   Judge     8     4     75 brown, black

3. Sometimes anomalies are the truth

. . . .

perfect_dataset %>%
  group_by(species) %>%
  summarise(mean(legs))
# A tibble: 3 x 2
  species `mean(legs)`
    <chr>        <dbl>
1     cat        3.900
2     dog        3.875
3   snake        0.000

tinytony